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Abstract 



The versatility of exponential families, along with their attendant convexity prop- 
erties, make them a popular and effective statistical model. A central issue is 
learning these models in high-dimensions, such as when there is some sparsity 
pattern of the optimal parameter. This work characterizes a certain strong con- 
vexity property of general exponential families, which allow their generalization 
ability to be quantified. In particular, we show how this property can be used to 
analyze generic exponential families under L\ regularization. 

1 Introduction 

Exponential models are perhaps the most versatile and pragmatic statistical model for a variety of 
reasons — modelling flexibility (encompassing discrete variables, continuous variables, covariance 
matrices, time series, graphical models, etc); convexity properties allowing ease of optimization; and 
robust generalization ability. A principal issue for applicability to large scale problems is estimating 
these models when the ambient dimension of the parameters, p, is much larger than the sample size 
n — the "p n" regime. 

Much recent work has focused on this problem in the special case of linear regression in high di - 
mensions, where it is a s sumed that the optimal parameter vector is sparse (e.g. IZhao and Yul Il2006ll . 
ICandes and Taol l2007ll . lMeinshausen and"^ l2009ll . lBickel et al.l ll2008ll ). This body of prior work 
focused on: sharply characterizing the convergence rates for the prediction loss; consistent model 
selection; and obtaining sparse models. As we tackle more challenging problems, there is a growing 
need for model s election in more general e xponential families . Recent work here includ es learning 
Gaussian graphs dRavikumar et alj Il2008b!l ) and Ising models ( iRavikumar et al.l Il2008al0 . 

Classical results established that consistent estimation in general exponential families is possible, in 
the asymptotic limit where the number of dimensions is held constan t (though some work establish es 
rates under certain conditions as p is allowed to grow slowly with n llPortn ov, 1988. iGhosall 1200010 . 
However, in modern problems, we typically grow p rapidly with n (so even asymptotically we are 
often interested in the regime where p ^S> n, as in the case of sparse estimation). While we have a 
handle on this question for a variety of special cases, a pressing question here is understanding how 
fast p can scale as a function of n in general exponential families — such an analysis must quantify 
the relevant aspects of the particular family at hand which govern their convergence rate. This is 
the focus of this work. We should emphasize that throughout this paper, while we are interested in 
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modelling with an exponential family, we are agnostic about the true underlying distribution (e.g we 
do not necessarily assume that the data generating process is from an exponential family). 

Our Contributions and Related Work The key issue in analyzing the convergence rates of expo- 
nential families in terms of their prediction loss (which we take to be the log loss) is in characterizing 
the nature in which they are strictly convex — roughly speaking, in the asymptotic regime where 
we have a large sample size n (with p kept fixed), we have a central limit theorem effect where the 
log loss of any exponential family approaches the log loss of a Gaussian, with a covariance matrix 
corresponding to the Fisher information matrix. Our first main contribution is quantifying the rate 
at which this effect occurs in general exponential families. 

In particular, we show that every exponential family satisfies a certain rather natural growth rate con- 
dition on their standardized moments and standardized cumulants (recall that the fc-th standardized 
moment is the unitless ratio of the fc-th central moment to the fc-th power of the standard deviation, 
which for k = 3, 4 is the skew and kurtosis). This condition is rather mild, where these moments can 
grow as fast as kl. Interestingly, similar conditions have been well studied for obtain ing exponential 
tail bounds for the convergence of a random variable to its mean lBernsteinlll946ll . We show that 
this growth rate characterizes the rate at which the prediction loss of the exponential family behaves 
as a strongly convex loss function. In particular, our analysis draws many parallels to that of the 
analysis of Newton's method, where there is a "burn in" phase in which a number of iterations must 
occur until the function behaves as a locally quadratic function — in our statistical setting, we now 
require a (quantified) "burn in" sample size, where beyond this threshold sample size, the prediction 
loss inherits the desired strong convexity properties (i.e. it is locally quadratic). 

Our second contribution is an analysis of L\ regularization in generic families, in terms of both 
prediction loss and the sparsity level of the selected model. Under a par ticular sparse eigen value 
condition on the design matrix (the Restricted Eigenvalue (RE) condition in lBickel et al.l ll2008ll ). we 
show how L\ regularization in general exponential families enjoys a convergence rate of 0( s 1 ° sp ) 
(where s is the number of relevant features). This RE condition is one of the l east stringen t cond i- 
tions which permit this optimal convergence rate for linear regression case (seelBickel et alj|l2008ll ) 
— stronger mutual incoherence/irrepresentable conditions considered in Zhao and Yul l2006ll also 
provide this rate. We show that an essentially identical convergence rate can be achieved for general 
exponential families — our results are non-asymptotic and precisely relate n and p. 

Our final contribution is one of approximate sparse model selection, i.e. where our goal is to obtain 
a sparse model with low prediction loss. A drawback of the RE condition in comparison to the 
mutual incoherence condition is that the latter permits perfect recovery of the true fe atures (at the 
price of a more stringent c ondition). However, for the case of the linear regression, IZhao and Yul 
ll2006ll . lBickel et all Il2008ll show that, under a sparse eigenvalue or RE condition, the L\ solution is 
actually sparse itself (with a multiplicative increase in the sparsity level, that depends on a certain 
condition number of the design matrix) - so while the the L\ solution may not precisely recover the 
true model, it still is sparse (with some multiplicative increase) and does recover those features with 
large true weights. 

For general exponential families, while we do not have a characterization of the sparsity level of the 
L\ -regularized solution (an interesting open question), we do however provide a simple two stage 
procedure (thresholding and refitting) which provides a sparse model, with support on no more than 
merely 2s features and which has nearly as good performance (with a rather mild increase in the 
risk) — this result is novel even for the square loss case. Hence, even under the rather mild RE 
condition, we can obtain both a favorable convergence rate and a sparse model for generic families. 

2 The Setting 

Our samples t £ MP are distributed independently according to D, and we model the process with 
P(t\9), where 9 6 <d. However, we do not necessarily assume that D lies in this model class. The 
class of interest is exponential families, which, in their natural form, we denote by: 

P(t\6) = h t cxp{(8,t) -logZ(0)} 

where t is the natural sufficient statistic for 9, and Z(9) is the partition function. Here, is the 
natural parameter space — the (convex) set where Z(-) is finite. While we work with an exponential 
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family in this general (though natural) form, it should be kept in mind that t can be the sufficient 
statistic for some prediction variable y of interest, or, for a generalized linear model (such as for 
logistic or line ar regression), we can have t be a function of both y and some covariate x (see 
iDobsonl Ill990ll ). We return to this point later. 

Our prediction loss is the likelihood function and 8* is the optimal parameter, i.e. 

C(8) = \ogP(t\6)}, 8* = argmin C{8) . 

where the argmin is over the natural parameter space and it is assumed that this 8* is an interior 
point of this space. Later we consider the case where 8* is sparse. 

We denote the Fisher information of P(-\6*) as T* = E t ^ P ^ e ^ [-V 2 log P(t\8*)] , under the 
model of 8* . The induced "Fisher risk" is 

\\8-8*\\ 2 r , := (8-8*^^(8-8*) . 

We also consider the L\ risk \\8 — 8*\\\. 

For a sufficiently large sample size, we expect that the Fisher risk of an empirical minimizer 8, 
\\8 — 0*\\jr t , be close to C(8) — C(8*) — one of our main contributions is quantifying when this 
occurs in general exponential families. This characterization is then used to quantify the convergence 
rate for L\ methods in these families. We also expect this strong convexity property to be useful for 
characterizing the performance of other regularization methods as well. 

All proofs can be found in the appendix. 



3 (Almost) Strong Convexity of Exponential Families 

We first consider a certain bounded growth rate condition for standardized moments and standard- 
ized cumulants, satisfied by all exponential families. This growth rate is fundamental in establishing 
how fast the prediction loss behaves as a quadratic function. Interestingly, this growth rate is analo- 
gous to those conditions used for obtaining exponential tail bounds for arbitrary random variables. 



3.1 Analytic Standardized Moments and Cumulants 

Moments: For a univariate random variable z distributed by p, let us denote its fc-th central mo- 
ment (centered at the mean) by: 

mk, p (z) = E z ^ p [z - m 1 ^ p (z)] k 

where m-y tP (z) is the mean E 2 ^ J(9 [z]. Recall that the fc-th standardized moment is the ratio of the 
/s-th central moment to the k-th power of the standard deviation, i.e. j^Ss ■ This normalization 
with respect to standard deviation makes the standard moments unitless quantities. For k — 3 and 
k = 4, the standardized moments are the skew and kurtosis. 

We now define the analytic standardized moment for z — we use the term analytic to reflect that if 
the moment generating function of z is analytical then z has an analytic moment. 

Definition 3.1. Let z be a univariate random variable under p. Then z has an analytic standardized 
moment of a if the standardized moments exist and are bounded as follows: 



Vfc > 3, 



mk, P (z) 



m 2 ,p(z) k / 2 



< -k\ a k ~ 2 
~ 2 



(where the above is assumed to hold if the denominator is 0). If t £ W is a multivariate random 
variable distributed according to p, we say that t has an analytic standardized moment of a with 
respect to a subspace V C W (e.g. a set of directions) if the above bound holds for all univariate 
z = (v, t) where v € V. 



1 Recall that a real valued function is analytic on some domain of W if the derivatives of all orders exist, 
and if for each interior point, the Taylor series converges in some sufficiently small neighborhood of that point. 
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This condition is rather mild in that the standardized moments increase as fast as kla k ~ 2 (in a sense 
a is just a unitless scale, and it is predominantly the fc! which makes the condition rather mild). 
This condition is closely related to those used in obtaining sharp exponential type tail bou nds for the 
conve rgence of a random variable to its mean — in particular, the Bernstein conditions iBernsteinl 
Il946ll are almost identical to the above, expect that they use the fc-fh raw moments (not central 
moments)^. In fact, these moment conditions are weaker than requiring "sub-Gaussian" tails. 

While we would not expect analytic moments to be finite for all distributions (e.g. heavy tailed 
ones), we will see that exponential families have (finite) analytic standardized moments. 



Cumulants: Recall that the cumulant-generating function / of z under p is the log of the moment- 
generating function, if it exists, i.e. f(s) = logE[e sz ]. The fc-th cumulant is given by the fc-th 
derivate of / at 0, i.e. Ck, P (z) = f^(0). The first, second, and third cumulants are just the first, 
second, and third central moments — higher cumulants are neither moments nor central moments, 
but rather more complicated polynomial functions of the moments (though these relationships are 
known). Analogously, the fc-th standardized cumulant is e Ck '^k/2 — this normalization with respect 
to standard deviation (the second cumulant is the variance) makes these unitless quantities. 

Cumulants are viewed as equally fundamental as central moments, and we make use of their be- 
havior as well — in certain settings, it is more natural to work with the cumulants. We define the 
analytic standardized cumulant analogous to before: 

Definition 3.2. Let z be a univariate random variable under p. Then z has an analytic standardized 
cumulant of a if the standardized cumulants exist and are bounded as follows: 



Vfc > 3, 



Ck, P {z) 



< -fc! a k ~ 
~ 2 



(where the above is assumed to hold if the denominator is 0). If t G W is a multivariate random 
variable distributed according to p, we say that t has an analytic standardized cumulant of a with 
respect to a subspace V C W if the above bound holds for all univariate z = (v, t) where v G V. 



Existence: The following lemma shows that exponential families have (finite) analytic standard- 
ized moments and cumulants, as a consequence of the analyticity of the moment and cumulant 
generating functions (the proof is in the appendix). 

Lemma 3.3. If t is the sufficient statistic of an exponential family with parameter 8, where 8 is an 
interior point of the natural parameter space, then t has both a finite analytic standardized moment 
and a finite analytic standardized cumulant, with respect to all directions in W. 



3.2 Examples 

Let us consider a few examples. Going through them, there are two issues to bear in mind. First, a 
is quantified only at a particular 8 (later, 8* is the point we will be interested in) — note that we do 
not require any uniform conditions on any derivatives over all 8. Second, we are interested in how 
a could depend on the dimensionality — in some cases, a is dimension free and in other cases (like 
for generalized linear models), a depends on the dimension through spectral properties of T* (and 
this dimension dependence can be relaxed in the sparse case that we consider, as discussed later). 



3.2.1 One Dimensional Families 

When 8 is a scalar, there is no direction v to consider. 



Bernoulli distributions In the canonical form, the Bernoulli distribution is, 

P(y\8) = c W (yd-log(l + e e )) 

with 8 G K = 9. We have mx(6*) = e e /(l + e e ). The central moments satisfy 111,2(8*) = 
mi(0*)(l - mi(0*)) and m k (8*) < m 2 (8*) for k > 3. Thus, a = l/^/m 2 (8*) is a standardized 

2 The Bernstein inequalities used in deriving tail bounds require that, for all k > 2, fjfrzy < \k\L k ~' 2 for 
some constant L (which has units of z). 
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analytic moment at any 8* 6 9. Further, Ck{8*) < €2(8*) = 1112(8*) for k > 3. Thus, a is also a 
standardized analytic cumulant at any 8* E <d. 

Unit variance Gaussian distributions In the canonical form, unit variance Gaussian is, 

P(y\8) = exp (-y)exp (y9 - ^ 

with ^ £ 1 = 6. We have mi(8*) = 8* and m 2 {8*) = 1. Odd central moments are and for 
even k > 4, we have rrik{8*) = 2 k i^(kl2)\ • Thus, a = 1 is a standardized analytic moment at any 
8* G 9. However, the log-likelihood is already quadratic in this case (as we shall see, there should 
be no "burn in" phase until it begins to look like a quadratic!). This becomes evident if we consider 
the cumulants instead. All cumulants Ck(0*) = for k > 3 and hence a = is a standardized 
analytic cumulant at any 8* £ 9 — curiously, cumulant generating function cannot be a finite order 
polynomial of order greater than 2. 

3.2.2 Multidimensional Gaussian Covariance Estimation (i.e. "Gaussian Graphs") 

Consider a mean zero p-dimensional multivariate Normal parameterized by the precision matrix 9, 

P(Y\0) = exp (^-i(9,yr T ) +logdct(9)) . 

A "direction" here is a positive semi-definite (p.s.d.) matrix V, and we seek the cumulants of the 
random variable (V, YY ) . 

Note that YY T has Wishart distribution W P (Q , 1) with the moment generating function, 

v ^E[cx P ((y,yy T ))] = det (i- 2vo~ 1 y 1/2 . 

Let Aj 's be the eigenvalues of VG _1 . Then, taking logs, the cumulant generating function f(s), 

p —'\ p 
f{s) = logE [cxp(.s(^yr T ))] = logJJ(l - 2sA i )" 1 / 2 = — ^log(l - 2 S A,) . 



The kth derivative of this is 



2^ (l-2s\i) k 

Thus, the cumulant c k>@ {V) = / (fc) (0) = 2 k - 1 (k - 1)! J2i x l Hence, for k > 3, 

Thus, a = v2 is a standardized analytic cumulant at 9. Note that it is harder to estimate the central 
moments in this case. This example is also interesting in connection to the analysis of Newton's 
method as the function log dct(9) is self-concordant on the cone of p.s.d. matrices. 

3.2.3 Generalized Linear Models 

Consider the case where we have some covariate, response pair (X, Y) drawn from some distribu- 
tion D. Suppose that we have a family of distributions P(-\0; X) such that, for each X, it is an 
exponential family with natural sufficient statistic t y: x, 

P(y\8- X) = h y cxp ((8, t VtX ) - log Z x {8)) , 

where 6 € 9. The loss we consider is C(8) = Ex,y~d [— ^ogP(y\6; X)]. A special case of this 
setup is as follows. Say we have a one dimensional exponential family 

q v (y) = h y exp(yv - log Z(v)) , 

where y,v E R. The family P(-\8; X) can be be simply q(e,x) (i- e - taking v = (8, X)). Thus, 

P{y\8- X) - hy cxp (y(8, X) - log Z((8, X))) . 
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We see that t y x = yX an d Zx[&) = Z((9,X)). For example, when q v is either the Bernoulli 
family or the unit variance Gaussian family, this corresponds to logistic regression or least squares 
regression, respectively. It is easy to see that the analogue of having a standardized analytic moment 
of a at 9 w.r.t. a direction v is to have 



m k ,e{v) . 1 



, k- 



n 



where 

m k ,o(v) =E X [m k ,p(.\e;x)((t y ,x,v))] . 

In the above equation, the expectation is under X ~ Dx, the marginal of D on X. If the sufficient 
statistic t y ,x is bounded by B in the L 2 norm a.s. and the expected Fisher information matrix 

Ex [E y „p mX ) [-V 2 log P(y\9;X)]] 

has minimum eigenvalue A min , then we can choose a = £>/A m ; n . Note that A m ; n could be small but 
it arose only because we are considering an arbitrary direction v. If the set of directions V is smaller, 
then we can often get less pessimistic bounds. For example, see section l5.2.2l in the appendix. We 
also note that similar bounds can be derived when we assume subgaussian tails for t y x rather than 
assuming it is bounded a.s. 

3.3 Almost Strong Convexity 

Recall that a strictly convex fu nction F is strongly convex if t he Hessian of F has a (uniformly) 
lower bounded eigenvalue (see iBovd and Vandenberghel ll2004l0 . Unfortunately, as for all strictly 
convex functions, exponential families only behave in a strongly convex manner in a (sufficiently 
small) neighborhood of 9*. Our first main result quantifies when this behavior is exhibited. 

Theorem 3.4. (Almost Strong Convexity) Let a be either the analytic standardized moment or cu- 
mulant under 9* with respect to a subspace V. For any 9 such that 9 — 9* G V, if either 

C{6) - L{9*) < — L or \\B-e*\\%< 



65a 2 " "" ' nj ~~ ~ 16a 2 
then 

\\\9-9*\\^ < C(9)-C(9*) < l\\6-6*\\% 

Suppose 9 is an MLE. Both preconditions can be thought of as a "burn in" phase — the idea being 
that initially a certain number of samples is needed until the loss of 9 is somewhat close to the 
minimal loss; after which point, the quadratic lower bound engages. This is analogous to the analysis 
of the Newton's meth od, which quantifies the numbe r of steps needed to enter the quadratically 
convergent phase (see IBovd and Vandenberghel 1200411 ). The constants of 1/4 and 3/4 can be made 
arbitrarily close to 1 jl (with a longer "burn in" phase), as expected under the central limit theorem. 

A key idea in the proof is an expansion of the prediction regret in terms of the central moments. We 
use the shorthand notation of Ck,g(A) and mfc^A) to denote the cumulants and moments of the 
random variable (A, t) under the distribution P {■]&). 

Lemma 3.5. (Moment and Cumulant Expansion) Define A = 9 — 9*. For all s G [0, 1], 



C(0* + sA) - C(9*) = Y\ Ck ^ ( A ) s " 

fc=2 

/ oo 

C(6* + sA) - £(0*) = log 1 + J2 y m k,e* (A)i 

\ fc=2 



where the equalities hold if the right hand sides converge. 

The proof of this Lemma (in the appendix) is relatively straightforward. The key technical step in 
the proof of Theorem 13.41 is characterizing when these expansions converge. Note that for A = 
9 — 9*, even if ||A||3r* < (one of our preconditions), a direct attempt at lower bounding 

C(9* + A) — C(9*) using the above expansions with the analytic moment condition would not 
imply these expansions converge — the proof requires a more delicate argument. 
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4 Sparsity 



We now consider the case where 9* is sparse, with support S and sparsity level s, i.e. 

S = {i:[e k ] i ?0}, a=\S\ 
In order to understand when L\ regularized algorithms (for linear regression) conv erge at a rate 
comparable to that of Lq algorithms (subset selection), iMeinshausen and Yil l2009ll considered a 
sparse eigenvalue conditio n on the design matr ix, where the eigenvalues on any small (sparse) subset 
are bounded away from 0. iBickel et al.l Il2008ll relaxe d this condition so that vectors whose support 
is "mostly" on any small subset are not too small (see lBickel et alj l2008ll for a discussion). We also 
consider this relaxed condition, but now on the Fisher matrix. 

Assumption 4.1. (Restricted Fisher Eigenvalues) For a vector S, let 5s be the vector such that 
V« G S, [Ss]i = Si and Ss is on the other coordinates, and let S c denote the complement of S. 
Assume that: 

VSs.t. HVIl! <3||<M|l, 11*11*- ><inll*s||2 

V<5 s.t. S s c = 0, \\5\\r* < < nax ||*s||2 

The constant of 3 is for convenience. Note we only quantify on the sup port S — a substantially 
weaker condition than in IMeinshausen and Yul ll2009ll . IBickel et al. [ 2008], which quantify over all 
subsets (in f act, many previous algorithms/analysis actual l y use t his condition on subsets different 
from S, e.g. IMeinshausen and Yul 1200911 . ICandes and Tad 120071 IZhanel l2008ll ). 

Furthermore, with regards to our analyticity conditions, our proof shows that the subspace of direc- 
tions we need to consider is now restricted to the set: 

V = {v:\\v s o\\ 1 <3\\v s \\i} (D 
Under this Restricted Eigenvalue (RE) condition, we can replace the minimal eigenvalue used in 
Example |3.2.3| by K* lin (section |5.2.2| in appendix), which could be significantly smaller. 



4.1 Fisher Risk 



Consider the following regularized optimization problem: 

= argmin eee E[- logP(y|0)] + A||0||i (2) 

where the empirical expectation is with respect to a sample. This reduces to the usual linear re- 
gression exam ple (for Gaussian means) and involves the log-determinant in Gaussian graph setting 
(considered in lRavikumar et all Il200 8bl) where 9 is the precision matrix (see Example |3.2.2| ). 

Our next main result provides a risk bound, under the RE condition. Typically, the regularization 
parameter A is specified as a function of the noise level, under a particular noise model (e.g. for linear 



regression case, where Y = j3X + r\ with the noise model r/ ~ A/"(0, ex 2 ), A is specified as cry 
llMeinshausen and Yul [20091 IBickel et all I2008H ). Here, our theorem is stated in a deterministic 
manner (i.e. it is a distribution free statement), to explicitly show that an appropriate value of A is 

determined by the norm of the measurement error, i.e. ||E[i] — E[i] ||oo — we then easily quantify 
A in a corollary under a mild distributional assumption. Also, we must have that this measurement 
error be (quantifiably) sufficiently small such that our "burn in" condition holds. 
Theorem 4.2. (Risk) Suppose that Assumption \4~l\ holds and A satisfies both 



\\E\t] -EMIU < - and A< *„„ „ (3) 

where a* is the analytic standardized moment or cumulant of 9* for the subspace V defined in (HJ. 
(Note this setting requires that ||E[t] — E[<] ||oo be sufficiently small). Then if 9 is the solution to the 
optimization problem in @, the Fisher risk is bounded as follows 

\\\9~9*\\^<C{9)-C{9*)<^ 

4 K* 
and the L\ risk is bounded as follows: 

l«-«*l.<^ 

^min 



nun 
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Intuitively, we expect the measurement error ||E[t] — E[t] ||oo to be 0{a\l -^^), so we think of A 



0(ay — jp). Note this would recover the usual (optimal) risk bound of 0(cr 2 slo J f p ) (i.e. the same 

rate as an Lq algorithm, up to the RE constant). Note that the mild dimension dependence enters 
through the measurement error. Hence, our theorem shows that all exponential families exhibit 
favorable convergence rates under the RE condition. 

The following proposition and corollary quantify this under a mild (and standard) distributional 
assumption (which can actually be relaxed somewhat). 

Proposition 4.3. If t is sub-Gaussian, ie. there exists a > such that Vi and Vs G R, 
g j- e a(fi-Eti)j < e <r s /2^ tnen ^ or any S > 0, with probability at least 1-5, 



l E M - E WII 



'log ? 



Bounded random variables are in fact sub-Gaussian (though unbounded t may also be sub-Gaussian, 
e.g. Gaussian random variables are obviously sub-Gaussian). The following corollary is immediate. 

Corollary 4.4. Suppose the Assumption \4.1\ and the sub-Gaussian condition in Proposition \4.3\ hold. 

For any 6 > 0, as long as n > ifa* 4 ||0*||?cr 2 log(f), (where K is a universal constant), setting 

A = 2<t y log (~ s \ we have with probability at least 1 — S, 



\K ■ J n k . V n 

^ min / mm 

4.2 Approximate Model Selection 

An important issue un addressed by the prev i ous re s ult is the sparsity le vel of our estimate 9. For the 
linear regression case, iMeinshausen and Yul ll2009ll . lBickel et al.1 ll2008ll show that the L\ solution is 

actually sparse, with a sparsity level of roughly 0(( ^T ax ) 2 s), (i.e. the sparsity level increases by 
a factor which is essentially a condition number squared). In the general setting, we do not have a 
characterization of the actual sparsity level of the L\ solution. 

However, we now present a two stage procedure, which provides a n estimate with support on merely 
2s features, with nearly as good risk dShalev-Shwartz et al.l l2009ll discuss this issue of trading spar- 
sity for accuracy, but their results are more applicable to settings with 0(-J=) rates.). Consider the 

procedure where we select the set of coordinates which have large weight under 9 (say greater than 
some threshold r). Then we refit to find an estimate with support only on these coordinates. That is, 
we restrict our estimate to the set T = {9 E 9 : 6*; = if \9i\ < t}. This algorithm is: 

e = argmin 0e0T £(0) + A||0||i (4) 
Theorem 4.5. (Sparsity) Suppose that \4. U holds and the regularization parameter A satisfies both 

Mi] EMIU < £ and A < mini I , — K *™\ } (5) 
2 270a* || fl* ||i 340K max a*v s 

where a* is the analytic standardized moment or cumulant of 9* for the subspace V defined in (Q3. 
If 9 is the solution of (O with this A and 9 is the solution of (|4]i with threshold r = ^l 8 ^ and this 
A, then: 

1. 9 has support on at most 2s coordinates. 

2. The Fisher risk is bounded as follows: 
1 - - / k* \ 2 Q sA 2 



""min 



8 



Using Proposition l4.3l we have following corollary. 

Corollary 4.6. Suppose the Assumption \4.1\ and the sub-Gaussian condition in Proposition \4.3\ 
hold. Then for any S > 0, as long as n > Ka* 2 a 2 log (^) max | s ^" a \ , <a* 2 ||0*||i| (where K is a 

universal constant), setting X = 2 y „ anrf threshold r = 36a/ » ^ , we /lave f/zaf w/f/i 

probability at least 1 — S, 

V Cm/ \< lin 2 / n 
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5 Appendix 



5.1 Proofs for Section |3] 

Proof, (of Lemma l3.3b The proof shows that the central moment generating function of z = (v, t), 
namely E[exp(s((u, t) — E[(v, £)]))], is analytic at 9. First, notice that 



E[exp(s((w, t) -E[(v,t)}))] = exp(-sE[(v,t)]) J h* exp(s{v,t)) exp{(0,i) - \ogZ(6)}dt 

J t ht exp{(9 + sv, t)}dt 



exp(-sE[(M>]) 
exp(-sE[(v,t}]) 



f t ht cxp{(8,t)}dt 
Z{6 + sv) 



Z{6) ■ 

It is known that for exponentia l families, Z(6) (namely, the partition function) is analytic in the 
interior of 9 (see iBrownl Ill986l0 . Since exp(— sE[(u, <)]) is also analytic (as a function of s), we 
have by the chain of equalities above that the central moment generating function is also analytic (as 
a function of s) for any 9 at the interior of 9. This property implies that the derivatives of the central 
moment generating function at s = (namely, the moments mk :P (z)) cannot grow too fast with fc. In 
particular, by proposition 2.2. 10 in lKrantz and Par ks 1 2002], it holds for all fc that the fc-th derivative 
(which is equal to mk, p (z)) is at most k\B k for some constant B. As a result, \mk, p (z) /m,2,p(z) k / 2 \ 
is at most \k\a k ~ 2 for a suitable constant a. Thus, t has finite analytic standardized moment with 
respect to all directions. 

As to the assertion about t having finite analytic standardized cumulant, notice that our argument 
above also implies that the (raw) moment generating function, E[exp(s(u, t))], is analytic. There- 
fore, log(E[exp(s(u, t))]), which is the cumulant generating function, is also analytic (since the 
logarithm is an analytic function). An analysis completely identical to the above leads to the desired 
conclusion about the cumulants of t. □ 

From here on, we slightly abuse notation and let mfc(A) be the fc-th central moment of the univariate 
random variable (A, t) distributed under 6*. 

Proof, (of Lemma |3.5t First, note that since 9* is optimal, we have ^t~D [t] = ^t~p(-\8*)[t\- Hence, 

Z{0* + sA) 



C(9* + sA) - C(6*) = -s(A, E^.ie.) [t]) + log 



Z(9*) 



... , z(e* + sA) 

-smi(A) +log- 



= log 



Z{6*) 
e- smi ^Z(9* + sA) 



Z(0*) 

In the proof of Lemma [331 it was shown that e - sm i( A ) ^i^+^J } s the central moment generating 
function, that it is analytic, and that the expression above is analytic as well. Their Taylor expansions 
complete the proof. □ 

The following upper and lower bounds are useful in that they guarantee the sum converges for the 
choice of s specified. 

Lemma 5.1. Let a and 9 be defined as in Theorem 13.41 Let A = 9 — 9* and set s = 
mini ; 1 li- If is a is an analytic moment, then 

l 4a^m 2 (A)' J J J 

1 m 2 (A) < y> m k (A)s k < 2 m 2 (A) 

3 max{16a 2 m2(A), 1} — ^-f fc! — 3 max{16a 2 TO2(A), 1} 

If is a is an analytic cumulant, then 

1 c 2 (A) ^ c k {A)s k 2 c 2 (A) 

3max{16a 2 c 2 (A),l} ~ ^ fc! ~ 3 max{16a 2 c 2 (A), 1} 
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Proof. We only prove the analytic moment case (the proof for the cumulant case is identical). First 
let us show that: 

f^<*> f i - ± {sa ^MA)A < E ! ^ * f 1 + X>^)) fc ) 

V fc=l / k=2 \ k=l / 

We can bound the following sum from k = 3 onwards as: 



ft! 



2^ v ' 2 

fe=3 fc=l 



which proves the claim. 
For our choice of s, 

k 00 / -, \ fe 

= 3 



E(^7^(A)) fc = E min I'«V^(A) <E 4 =3 
fe=i fe=i ^ ^ ' ' fe=i ^ ' 



Hence, we have: 



fe=2 V fc=l / 



> s 2 m 2 (A) 



3 

1 m 2 (A) 



3 max{16a 2 m2(A), 1} 



Analogously, the upper bound can be proved. □ 



The following core lemma leads to the proof of Theorem 13. 41 
Lemma 5.2. Let a and 9 be defined as in Theorem \3.4\ We have that: 



L l y\,-) - t| ' | -' yi <6) 

Furthermore, if \\9 — 9*\\jr„ < j^s, 

\\\9-9*\\%. < C(6)-£(9*) < \\\e-6*\\% 

Proof. As s is clearly in [0, 1] and by convexity, we have: 

C{6) - C{9*) = C(9* + A) - £(6>*) 
> C(9* + sA) - £(0*) 

For the cumulant case, we have that this is lower bounded by ^ r .."M A ^ ... using Lemma [5T1 

and Lemma [331 which proves ©. Now consider the analytic moment case. By, Lemma [331 we 
have 

C(9) £(9*) > log(l + f1 7 ( 2 A) . A ^ Tt ) 

3 max{loa z TO2(Aj, 1} 

Now by Jensen's inequality, we know that the fourth standardized moment (the kurtosis) is greater 
than one, so a 2 > (since |a 2 > 1). This implies that: 

m ^ < 1 < 1/4 



3max{16a 2 m 2 (A), 1} ~ 48a 2 



11 



since the sum is only larger if we choose any argument in the max. Now for < x < 1/4, we have 
that log(l + x) > 1 + x — x 2 > 1 + jx. Proceeding, 

i w 2 (A) ma(A) 



3max{16a 2 m 2 (A), 1} ; ~ 4max{16a 2 m 2 (A), 1} 
which proves © (for the analytic moment case). 

For the second claim, the precondition implies that the max, in (O, will be achieved with the argu- 
ment of 1, which directly implies the lower bound. For the upper bound, we can apply Lemma [5T1 

with s = 1 (s = 1 under our precondition), which implies that J2kL ?. * s ^ ess tnan l m2 (^)- 

The claim follows directly for the cumulant case using Lemma TOl with s = 1. For the moment 
case, we use that log(l + x) < x. □ 



We are now ready to prove Theorem l3.4l 

Proof, (of Theorem El) If ||0 - 6*\\% < then the previous Lemma implies the claim. Let 

us assume the condition on the loss, i.e. C{9) - C{6*) < If \\9 - 6*\\% < j^, then 

we are done by the previous argument. So let us assume that \\9 — 6*\\jr* > j^i- Hence, 
max{16a 2 m 2 (A), 1} = 16a 2 ra 2 (A). Using ©, we have that < C{9) - £(#*), which is 
a contradiction. □ 

5.2 Proofs for Section S] 
5.2.1 Proof of Theorem^ 

Throughout, let C{9) = E[- \ogP{y\9)]. Also, let T = E[t] and f = %]. 

Lemma 5.3. Suppose that (0) holds (i.e. that \\T— T||oo < A/2). Let 9 be a solution the optimization 
problem in (fJJ). For all 9 £ 0, we have: 

c(6) - c(9) < ~\\e- e\\ x + xpWx - xph (7) 

Furthermore, suppose that 9 only has support on S, then: 

3) 
~2 



C(9) - C(9) < —\\9 S - 9h (8) 



Proof. Since 9 solves d2j, we have: 

- (§, f) + log Z0) + X\\9h < - {9, f) + log Z{9) + Xph 

Hence, 

-(§, T) + log Z(9) + A||0||i < (6 — 6,T — T) — (9, T) + log Z{9) + X\\9\\ 
Using this and the condition on A, we have 

C{9)~C{9) < (9-9,f-T)+X\\9\\ 1 -Xf 



i 



< \\e-9\\ 1 \\f-T\\ oc + X\\9\\ 1 -X\ 

< ^l|£-0||i + A||0||i-A||fl||i 



which proves the first inequality. Continuing, 

A„ 



- llu -9\\ 1 + X\\9\\ 1 -X\\9\\ 1 
<^(||^||i + ||fi|| 1 ) + A||f||i-A||^||i 
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which proves the next inequality. 

For the final claim, using the sparsity assumption on 9, we have: 

£(§) - c(9) < ^\\§ - + xph - xwe^ 



A„ s „„ A 

2 
A 



s-fllli + gll^lli+A^lli-ll^ll^-AH^II 



where the second to last step uses the triangle inequality. This completes the proof. □ 

Lemma 5.4. Suppose that <[3j holds. Let 9 be a solution the optimization problem in For any 
9 £ 0, which only has support on S and such that £(9) > £{9), then: 

||^lli<3||fls-fl||i (9) 
||0-0||i<4||0 s -0||i (10) 

Proof. By assumption on 9 and (0, 

<£(§)- £(9) < ^0-011! +AII0H1-AH0II! 
Dividing by A and adding i || 9 — 6 1| i to both the left and right sides, 

\p-9W, < ^-e^ + weh-wSh 

For any component i ^ S, we have that |0j — 6i\ + |0j| — |0,| = 0. Hence, 

\\\0- 9\\, <\\9 S - 9\\, + \\9\\, - \\9sW, < 2\\9 S - 9^ 

where the last step uses the triangle inequality (\\9\\ i — \\9s\\i < \\9s — 9\\i). This proves (TTOb. From 
this, 

lps-0\\ 1 + ±\\9 s c\\ 1 = ±\\9-9\\ 1 <2\\9 s -9\\ 1 
which proves ©, after rearranging. □ 

Now we are ready to prove Theorem l4.2l 

Proof, (of Theorem l4.2b . First, by ([3]) and (0 we see that 

£{9) - £{$*) < 



65a* 2 

(note that 9 satisfies the RE precondition, so 9 — 9* 6 V). Hence using Theorem l3.4l we see that 

\\\d-<r\\r* <c(o)-c(o*) 

On the other hand observe that: 

K 



\\o s - e*\\x < V^\\o s - e*h < -^-P - e*\\r* (ii) 



where the last step uses the Restricted Eigenvalue Condition, Assumption l4.ll Now using the above 
with (© we have that 

h\e- < £0) £{e*) < |^||0 - 0*||^ 

4 ZA "min 
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Hence, 

\\6-8*y*<^ (12) 

K min 

and so 

1 » » QA 2 s 

\\6-9*\\%<£(6)-£(e*)<^ 

which proves the first claim. 

Now to conclude the proof note that by Assumption 14. II 

KUn¥s-e*h<\\6-e*y*<^ 

K min 

Hence by ( TToT > we see that 

\\§ - e*\u < M\9 S - e*\U < Ws\\h - e*h < 

K min 

This concludes the proof. □ 



5.2.2 Analytic Standardized Moment for GLM and Sparsity 

In the generalized linear model example in Section [323] we showed that if the sufficient statistics 
are bounded by B and if T* has minimum eigenvalue A m i n , then we can choose a = £?/A mm . 
However, when 9* is sparse we see that in both Theorems I4.2I and I4.5I we only care about a* the 
analytic standardized moment/cumulant of the set V, specified in (Q]). Given this, it is clear from 
the exposition in the generalized linear model example in Section [3.2.3l that a* can be bounded by 
£?/ K min' since all elements of the set V satisfy Assumption |4.1| 

5.2.3 Proof of Theorem|43l 

Lemma 5.5. (Sparsity or Restricted Set) If the threshold t 
any 9 G 6 r is at most 2s 



= t \ , then the size of the support of 



Proof. First notice that on the set S thresholding could potentially leave all the s coordinates. On the 
other hand notice that if we threshold using t, then the number of coordinates that remain undipped 
in the set S c is bounded by || 9 go \\ i/r . Hence 



< s 



% : \0i\ > t 

By (O, (fT2l and the RE assumption, we have 

||0 s c||i<3||0s-0*||i<3^||0s 

Using this we see that 



9 s c 



< 



18As 



> T 



< S 



18As 



Plugging in the value of r we get the statement of the lemma since support size of 9 T upper bounds 
the support size of any 9 € T . □ 



Lemma 5.6. (Bias) Choose r 



18A 



7 . Then, 

c(9 s) - an < M0 < r l sX2 



where 9 T is defined as 9\ = 9il 
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Proof. Note that 



Ijc* < K max </ ||2 

<2<« 2 (ll^-W + ll^-«*lla 

<2 K * m J ( S x 2 + ||^-^!H 

o 36sA 2 
< 2kL»J I ST + -j— j 

K min 



Where the last step is obtained by applying Theorem l4.2l Substituting for r, 



2„\2 

'S ~ u 1 1 J 7 * ^ I 4 



*||2 < max (13) 



Now the condition on A in © implies that Theorem l3.4l is applicable, which completes the proof. □ 



Proof of Theorem \4.5\ The first claim of the theorem follows from Lemma |531 We prove the second 
claim of the theorem by considering two cases. First, when C(9) < C(9 T S ). In this case by Lemma 



we have 



KAC\ic* 2 «\ 2 

C{9) - C(9*) < 04UKmax SA 



K* 4 
^min 



Also by (0, applying Theorem l3.4l we see that 



he- e*\\% < c{9) - c(6*) < — ™ 2sA2 



-i 



which gives us the second claim of the theorem. The next case is when C(9) > C{9 T S ). In this case, 
by applying Lemma |5~3l with 9 = 9 T S , we see that 

c(9) c{9i) < Y \\ei\\i < Y \\e* mi + yll^lli 

ZK min Z 

18V5A 2 sk* 3A, 



~ V 3 maX + T H^Hl 
K min Z 

where the last step is using dot . Hence we see that 

CQ1 „* 2 <jA 2 3A 

£(0) - C(9*) < C{9) C{§1) + C(9l) £(9*) < + -— 



2 
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Hence by condition © on A we see that the pre-condition of the Theorem l3.4l is satisfied and hence 
we see that 

L "9 - e*\\%* < £{6) - C(0*) < £{9) - £(9 T S ) + C(6 T S ) - £{6*) 



4 



< £(9) - C(0l) 



540< ax 2 sA 



K * 4 

min 



<»,*.«,, + ™!f»£» (14) 

I re ■ 

mm 

<6A||g s -eS|| 1 + 540K K aA2 ( 15 ) 



<6Ayi||0 s -0l}|| 2 



540K* nax sA 
k* 4 

""min 



mm 



K min K min K min 

6AVs ~ 161< iax sA 2 540< iax 2 sA 2 
< -— ||6>-0 ||^.+ — -2 + —-i (17) 

min "'min "min 

Where ( TPfl ) is obtained by applying Lemma l531 on = 6 r and (fl3T l is by Lemma lJTfl with G = T . 
(fTol l is by Assumption ^. 1 l and (TTvT > is due to (1131) , Simplifying we conclude that 

\\\B- 6*\\%, < £{9) £(9*) < *£/*\\0 4- ^^K^ ( 18 > 

K min K min 

2 , 2 

By the inequality that for any a,b G M, ab < 4- ^- we have 

l,,; 288A 2 .s 2804k* 2 sA 2 



Thus 



Using this in (Q~8]l 



110- 0*|U, < 24A ^ _|_ 75K max^^ 
K min K min 



£(§) - £(9*) < 



144A 2 s 450< nax A 2 s 70lK*„ ax 2 sA 2 
K * 2 + ~i 3 + ~n 

min min min 



Simplifying we get the second claim of the theorem for the second case. □ 
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